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Abstract 

Given a symmetric nonnegative matrix A, symmetric nonnegative matrix factorization (sym- 
NMF) is the problem of finding a nonnegative matrix H, usually with much fewer columns than 
A, such that A ps HH^. SymNMF can be used for data analysis and in particular for various 
clustering tasks. In this paper, we propose simple and very efficient coordinate descent schemes 
to solve this problem, and that can handle large and sparse input matrices. The effectiveness of 
our methods is illustrated on synthetic and real-world data sets, and we show that they perform 
favorably compared to recent state-of-the-art methods. 

Keywords, symmetric nonnegative matrix factorization, coordinate descent, completely positive 
matrices. 

1 Introduction 

Nonnegative matrix factorization (NMF) has become a standard technique in data mining by providing 
low-rank decompositions of nonnegative matrices: given a nonnegative matrix X E and an 

integer r < min(m,n), the problem is to find W E and H E R”^'’ such that X ss WH^. In 

many applications, the nonnegativity constraints lead to a sparse and part-based representation, and 
a better interpretability of the factors, e.g., when analyzing images or documents [25j . 

In this paper, we work on a special case of NMF where the input matrix is a symmetric matrix A. 
Usually, the matrix A will be a similarity matrix where the (i, j)th entry is a measure of the similarity 
between the ith and the jih. data points. This is a rather general framework, and the user can decide 
how to generate the matrix A from his data set by selecting an appropriate metric to compare two 
data points. As opposed to NMF, we are interested in a symmetric approximation with the 

factor H being nonnegative-hence symNMF is an NMF variant with W = H. If the data points are 
grouped into clusters, each rank-one factor H{:, j)'^ will ideally correspond to a cluster present 

in the data set. In fact, symNMF has been used successfully in many different settings and was proved 


I 


to compete with standard clustering techniques such as normalized cut, spectral clustering, /c-means 
and spherical A:-means; see [Ml [281 nni iMi [sa iM [33] and the references therein. 

SymNMF also has tight connections with completely positive matrices mm, that is, matrices of 
the form A = , H > 0, which play an important role in combinatorial optimization [Tj- Note that 

the smallest r such that such a factorization exists is called the cp-rank of A. The focus of this paper 
is to provide efficient methods to compute good symmetric and nonnegative low-rank approximations 
with H > 0 of a given nonnegative symmetric matrix A. 

Let us describe our problem more formally. Given a n-by-n symmetric nonnegative matrix A and 
a factorization rank r, symNMF looks for an n-by-r nonnegative matrix H such that A ~ HH^. The 
error between A and its approximation can be measured in different ways but we focus in this 

paper on the Frobenius norm: 

mm FiH) = -\\A-, (1) 

H>Q ^ ^ 4 II ll-P ’ 

which is arguably the most widely used in practice. Applying standard non-linear optimization schemes 
to m, one can only hope to obtain stationary points, since the objective function of ([T]) is highly non- 
convex, and the problem is NP-hard [T2|. For example, two such methods to find approximate solutions 
to ([I]) were proposed in [24] : 

1. The first method is a Newton-like algorithm which exploits some second-order information with¬ 
out the prohibitive cost of the full Newton method. Each iteration of the algorithm has a 
computational complexity of O(n^r) operations. 

2. The second algorithm is an adaptation of the alternating nonnegative least squares (ANLS) 
method for NMF [201121j where the term ||VF — penalizing the difference between the two 
factors in NMF is added to the objective function. That same idea was used in m where 
the author developed two methods to solve this penalized problem but without any available 
implementation or comparison. 

In this paper, we analyze coordinate descent (CD) schemes for ([1]). Our motivation is that the most 
efficient methods for NMF are CD methods; see [iiii27i[iaii7] and the references therein. The reason 
behind the success of CD methods for NMF is twofold: (i) the updates can be written in closed-form 
and are very cheap to compute, and (ii) the interaction between the variables is low because many 
variables are expected to be equal to zero at a stationary point m- 

The paper is organized as follows. In section [2l we focus on the rank-one problem and present the 
general framework to implement an exact CD method for symNMF. The main proposed algorithm is 
described in section [S] Section [H discusses initialization and convergence issues. Section [5] presents 
extensive numerical experiments on synthetic and real data sets, which shows that our CD methods 
perform competitively with recent state-of-the-art techniques for symNMF. 

2 Exact coordinate descent methods for SymNMF 

Exact coordinate descent (CD) techniques are among the most intuitive methods to solve optimization 
problems. At each iteration, all variables are fixed but one, and that variable is updated to its optimal 
value. The update of one variable at a time is often computationally cheap and easy to implement. 
However little interest was given to these methods until recently when CD approaches were shown 
competitive for certain classes of problems; see [32] for a recent survey. In fact, more and more 
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applications are using CD approaches, especially in machine learning when dealing with large-scale 
problems. 

Let us derive the exact cyclic CD method for symNMF. The approximation of the input 

matrix A can be written as the sum of r rank-one symmetric matrices: 

r 

A « (2) 

k=l 


where ^ is the A:th column of H. If we assume that all columns of H are known except for the jth, 
the problem comes down to approximate a residual symmetric matrix with a rank-one nonnegative 


symmetric matrix 


min 




(3) 


where 

r 

rU) = a- (4) 

k=l,k^j 


For this reason and to simplify the presentation, we only consider the rank-one subproblem in the 
following section 12.11 before presenting on the overall procedure in section 12.21 


2.1 Rank-one Symmetric NMF 

Given a n-by-n symmetric matrix P G let us consider the rank-one symNMF problem 

h>o ^ i 11 ^ “ 

where h G M”. If all entries of P are nonnegative, the problem can be solved for example with the 
truncated singular value decomposition; this follows from the Perron-Frobenius and Eckart-Young 
theorems. In our case, the residuals will in general have negative entries-see (j4|)-which makes 
the problem NP-hard in general [2]. The optimality conditions for ([5]) are given by 

h > 0, V/(/i) > 0, and hi Vf{h)i = 0 for all i, (6) 

where Vf{h)i the ith component of the gradient Vf{h). For any 1 < i < n, the exact CD method 
consists in alternatively updating the variables in a cyclic way: 

for i = 1, 2,... , n : hi ^ hf, 

where hf is the optimal value of hi in ([5]) when all other variables are fixed. Let us show how to 
compute hf. We have: 

v/(/i), = hf + ( hf-pAhi- (7) 

'• -V-^ '---^ 

di bi 
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where 


^ - Pa = \\h\\^ - hj - Pa , and (8) 

1=1,l^i 

h = - hiPii = hiPii - h^P-.^i. (9) 

1 = 1,l^i 

If all the variables but hi are fixed, by the complementary slackness condition ([6]), the optimal solntion 
hf will be either 0 or a solution of the equation Vf{h)i = 0, that is, a root of + aix + hi. Since 
the roots of a third-degree polynomial can be computed in closed form, it suffices to first compute 
these roots and then evalnate f{h) at these roots in order to identify the optimal solntion hf. The 
algorithm based on Cardano’s method (see for example [8]) is described as Algorithm [1] and runs in 
0(1) time. Therefore, given that a* and hi are known, h'^ can be computed in 0(1) operations. 


Algorithm \ x = BestPolynomialRoot{a, b) 


1 : 

2 : 

3: 

4: 

5: 

6 : 

7: 

8 : 

9: 

10 : 

11 : 

12 : 

13: 

14: 

15: 

16: 

17: 


INPUT: a G M, 6 G M 

OUTPUT: argmina; ^ ^ + bx snch that x > 0. 

A = 4a3 27b‘^ 



if A < 0 then 

r = 2^\ 


^ _ phaseangle{d) 

^ ~ 3 

z* = 0,y* = 0 

for /c = 0 : 2 do 

z = rcos (0 + ^) 

if z >0 and ^ + bz < y* then 

:tc 

z = z 

y* = ^ + a^ + bz 

end if 
end for 

x = z 

else _ 



19: if z > 0 and ^ + bz < 0 then 

20: X = z 

21: else 

22: X = 0 

23: end if 

24: end if 


The only inputs of Algorithm [T] are the quantities ([8]) and ([9]). However, the variables in ([5]) are 
not independent. When hi is updated to h'l, the partial derivative of the other variables, that is, the 
entries of V f{h), must be updated. For Z G {i + 1, ...,n}, we update: 

ai^ ai +{hff-h‘f and k ^ bi + Pii{hf - hi) . (10) 
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This means that updating one variable will cost 0{n) operations due to the necessary run over the 
coordinates of h for updating the gradient. (Note that we could also simply evaluate the ith entry 
of the gradient when updating hi, which also requires 0{n) operations; see section [3l) Algorithmic 
describes one iteration of CD applied on problem ([5]). In other words, if one wants to find a stationary 
point of problem ([5]), Algorithm [2] should be called until convergence, and this would correspond 
to applying a cyclic coordinate descent method to dH). In lines 3113 the quantities Oj’s and ftj’s are 
precomputed. Because of the product P-.^i needed for every bi, it takes O(n^) time. Then, from line 
[8] to line uni Algorithm [T] is called for every variable and is followed by the updates described by (fTOl) . 
Finally, Algorithm [2] has a computational cost of 0{n?) operations. Note that we cannot expect a 
lower computational cost since computing the gradient (and in particular the product Ph) requires 
Oiin?) operations. 


Algorithm 2 h = rankoneCDSymNMF{P, ho) 
1: INPUT: P G ho G 
2: OUTPUT: h£Wl 
3: h = ho 

4: for i = 1 : n do 
5: ai = \\h\\l - hf - Pa 

6: hi = hiPii h P- i 

7: end for 
8: for i = 1 : n do 

9: hf = BestPolynomialRoot{ai,bi) 

10: for I > i do 

11: ai ^ ai + {hf )^ - h‘f 

12: bi bi -\- Pii{hf — hi) 

13: end for 

14: hi = hf 

15: end for 


2.2 First exact coordinate descent method for SymNMF 

To tackle SymNMF dTJ, we apply Algorithm [2] on every column of H successively, that is, we apply 
Algorithm [2] with h = H{:,j) and P = for j = l,...,r. The procedure is simple to describe, 
see Algorithm 3] which implements the exact cyclic CD method applied to SymNMF. One can easily 
check that Algorithm [3] requires 0{'n?r) operations to update the nr entries of PI once: 

• In step 31 the full residual matrix R = A — HH^ is precomputed where the product 
requires O(rn^) operations. 

• In step [3 the residual matrix R^^'^ can be computed using the fact that R^^^ = R + H- jRp, 
which requires O(n^) operations. 

• In step [3 Algorithm [2] is called, and requires 0{n^) operations. 

• In steplHl the full residual matrix R = — is updated, which requires O(n^) operations. 
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Algorithm 3 H = generalCDSymNMF{A,HQ) 
1: INPUT: A e Hq G 
2: OUTPUT: H G 
3: H = Ho 
4: R = A-HH'^ 

5: while stopping criterion not satisfied do 
6: for j = 1 : r do 

7: ^R + 

8: H- j •(— rankoneCDSymNMF{R^^\ H-j) 

9: r ’^ R^^ - 

10: end for 

11: end while 


Algorithm [3] has some drawbacks. In particular, the heavy computation of the residual matrix R 
is unpractical for large sparse matrices (see below). In the next sections, we show how to tackle these 
issues and propose a more efficient CD method for symNMF, applicable to large sparse matrices. 

3 Improved Implementation of Algorithm [3] 

The algorithm for symNMF developed in the previous section (Algorithm [3]) is unpractical when the 
input matrix A is large and sparse; in the sense that although A can be stored in memory, Algorithm [3] 
will run out of memory for n large. In fact, the residual matrix R with n? entries computed in step 
0] of Algorithm [3] is in general dense (for example if the entries of H are initialized to some positive 
entries-see section H, even if A is sparse. Sparse matrices usually have 0(n) non-zero entries and, 
when n is large, it is unpractical to store O(n^) entries (this is for example typical for document data 
sets where n is of the order of millions). 

In this section we re-implement Algorithm [3] in order to avoid the explicit computation of the 
residual matrix R] see Algorithm 01 While Algorithm [3] runs in O(rn^) operations per iteration and 
requires O(n^) space in memory (whether or not A is sparse). Algorithm 0] runs in 0(r max(A, nr)) 
operations per iteration and requires 0(max(A', nr)) space in memory, where K is the number of 
non-zero entries of A. Hence, 

• When A is dense, K = 0{n‘^) and Algorithm 0] will have the same asymptotic computational 
cost of 0{r'n?) operations per iteration as Algorithm [3l However, it performs better in practice 
because the exact number of operations is smaller. 

• When A is sparse, K = 0{n) and Algorithm 0] runs in 0{r‘^n) operations per iteration, which 
is significantly smaller than Algorithm [3] in 0{rn‘^), so that it will be applicable to very large 
sparse matrices. In fact, in practice, n can be of the order of millions while r is usually smaller 
than a hundred. This will be illustrated in section [5] for some numerical experiments on text 
data sets. 

In the following, we first assume that A is dense when accounting for the computational cost of 
Algorithm 01 Then, we show that the computational cost is significantly reduced when A is sparse. 
Since we want to avoid the computation of the residual Q, reducing the problem into rank-one 
subproblems solved one after the other is not desirable. To evaluate the gradient of the objective 
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function in ([T]) for the (i, j)th entry of H, we need to modify the expressions ([5]) and Q by substituting 
with A - Ek=i,k^j H;kH^k- We have 

S7h.,F{H) = (^\\\A - = Hfj + + kj, 

where 

Uij = - 2Hfj - An, and ( 11 ) 

bij = ( 12 ) 

The quantities aij and bij will no longer be updated during the iterations as in Algorithm El but rather 
computed on the fly before each entry of H is updated. The reason is twofold: 

• it avoids storing two n-by-r matrices, and 

• the updates of the bij^s, as done in (fT0]l . cannot be performed in 0{n) operations without the 
matrix ^ 


However, in order to minimize the computational cost, the following quantities will be precomputed 
and updated during the course of the iterations: 

• for all i and for all j: if the values of and are available, aij can be 

computed in 0(1); see (fTTI) . Moreover, when Hij is updated to its optimal value we only 
need to update and which can also be done in 0(1): 

+ (13) 

\\H,jf^\\H,jf + {H+)^-Hf^. (14) 

Therefore, pre-computing the ||ffj^:|p’s and ||//:j|p’s, which require 0{rn) operations, allows us 
to compute the a^’s in 0(1). 

• The r-by-r matrix H: by maintaining H^H, computing H),j requires 0(r) opera¬ 
tions. Moreover, when the (i,j)th entry of H is updated to updating requires 0(r) 

operations: 


^ - Hij), 


k = l,...,r. (15) 


To compute bij, we also need to perform the product HFA,j-, see (fT^ . This requires 0(n) operations, 
which cannot be avoided and is the most expensive part of the algorithm. 

In summary, by precomputing the quantities and H^H, it is possible to apply one 

iteration of CD over the nr variables in 0{n^r) operations. The computational cost is the same as in 
Algorithm El in the dense case, but no residual matrix is computed; see Algorithmic 

From linelCto line 1101 the precomputations are performed in O(nr^) time where computing 
is the most expensive part. Then the two loops iterate over all the entries to update each variable 
once. Computing bij (in line [T5]l is the bottleneck of the CD scheme as it is the only part in the two 
loops which requires 0{n) time. However, when the matrix A is sparse, the cost of computing HFA,j 
for all i, that is computing H^A, drops to 0{K) where K is the number of nonzero entries in A. 
Taking into account the term Hi^,{H'^H)j^, to compute bij that requires 0{r) operations, we have that 
Algorithm |4] requires 0(r max(AT, nr)) operations per iteration. 
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Algorithm 4 H = cyclicCDSymNMF{A,Ho) 


1: INPUT: A e G 

2: OUTPUT; H G 
3 : H = Ho 
4: for j = 1 : r do 
5 : C, = \\H,,f 

6 : end for 
7: for i = 1 : n do 
8: Li = \\Hi,f 

9 : end for 
10: D = H'^H 

11: while stopping criterion not satisfied do 
12: for J = 1 ; r do 

13 : for i = 1 ; n do 

14 : aij ■<— Cj -\- Li — — An 

15 : bij ^ Hi {D)j, - HlA,^i - Hi - Hi,ai, 

16 : H^ ^ BestPolynomialRoot{aij, bij) 

17 : c,^C, + {Hl)^-Hl 

18 : LiLi + {HI-)"^ - Hi 

19 : D,,^D,,-H,,{Hl-H,j) 

20 : ^'-jj ^ 

21: end for 

22: end for 

23 : end while 






4 Initialization and Convergence 


In this section, we discuss initialization and convergence of Algorithm HI We also provide a small 
modification for Algorithm H] to perform better (especially when random initialization is used). 

Initialization In most previous works, the matrix H is initialized randomly, using the uniform 
distribution in the interval [0,1] for each entry of H [23]. Note that, in practice, to obtain an unbiased 
initial point, the matrix H should be multiplied by a constant (3* such that 

/3* = argmmjjA - {l3Ho){l3Hof\\F 

I l {AHo,Ho) 

\l ^WH^HoWy 

This allows the initial approximation HqHq to be well scaled compared to A. When using such an 
initialization, we observed that using random shuffling of the columns of H before each iteration (that 
is, optimizing the columns of in a different order each time we run Algorithm |4|) performs in general 
much better; see section El 

Remark 1 (Other heuristics to accelerate coordinate descent methods). During the course of our 
research, we have tried several heuristics to accelerate Algorithm^ including three of the most popular 
strategies: 

• Gauss-Southwell strategies. We have updated the variables by ordering them according to some 
criterion (namely, the decrease of the objective function, and the magnitude of the corresponding 
entry of the gradient). 

• Variable selection. Instead of optimizing all variables at each step, we carefully selected a subset 
of the variables to optimize at each iteration (again using a criterion based on the decrease of 
the objective function or the magnitude of the corresponding entry of the gradient). 

• Random shuffling. We have shuffled randomly the order in which the variables are updated in 
each column. This strategy was shown to be superior in several context, although a theoretical 
understanding of this phenomenon remains elusive 

However, these heuristics (and combinations of them) would not improve significantly the effectiveness 
of Algorithm^ hence we do not present them here. 

Random initialization might not seem very reasonable, especially for our CD scheme. In fact, at 
the first step of our CD method, the optimal values of the entries of the first column H- i of H are 
computed sequentially, trying to solve 

r 

^inJlR(^) - with 

k=2 

Hence we are trying to approximate a matrix which is the difference between A and a randomly 
generated matrix X)fc =2 does not really make sense. In fact, we are trying to approximate 

a matrix which is highly perturbed with a randomly generated matrix. 
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Figure 1: Comparison of the basis elements obtained with symNMF on the CBCL data set (r = 49) 
with (left) zero initialization and (right) random initialization. 


It would arguably make more sense to initialize H at zero, so that, when optimizing over the 
entries of H- i at the first step, we only try to approximate the matrix A itself. It turns out that this 
simple strategy allows to obtain a faster initial convergence than the random initialization strategy. 
However, we observe the following: this solution tends to have a very particular structure where the 
first factor is dense and the next ones are sparser. The explanation is that the first factor is given 
more importance since it is optimized first hence it will be close to the best rank-one approximation of 
A, which is in general positive (if A is irreducible, by Perron-Frobenius and Eckart-Young theorems). 
Hence initializing H at zero tends to produce unbalanced factors. However, this might be desirable in 
some cases as the next factors are in general significantly sparser than with random initialization. To 
illustrate this, let us perform the following numerical experiment: we use the CBCL face data set (see 
section [5|) that contains 2429 facial images, 19 by 19 pixels each. Let us construct the nonnegative 
matrix X G ]^36ix2429 .^y]20];-0 0 ach column is a vectorized image. Then, we construct the matrix 
A = XX^ G M^ 61 x 361 contains the similarities between the pixel intensities among the facial 
images. Hence symNMF of A will provide us with a matrix H where each column of H corresponds 
to a ‘cluster’ of pixels sharing some similarities. Figure [T] shows the columns of H obtained (after 
reshaping them as images) with zero initialization (left) and random initialization (right) with r = 49 
as in [25]. We observe that the solutions are very different, although the relative approximation 
error ||H — are similar (6.2% for zero initialization vs. 7.5% for random initialization, 

after 2000 iterations). Depending on the application at hand, one of the two solutions might be 
more desirable: for example, for the CBCL data set, it seems that the solution obtained with zero 
initialization is more easily interpretable as facial features, while with the random initialization it can 
be interpreted as average/mean faces. 

This example also illustrates the sensitivity of AlgorithmSlto initialization: different initializations 
can lead to very different solutions. This is an unavoidable feature for any algorithm trying to find a 
good solution to an NP-hard problem at a relatively low computational cost. 

Finally, we would like to point out that the ability to initialize our algorithm at zero is a very 
nice feature. In fact, since iL = 0 is a (first-order) stationary point of ([I|), this shows that our 
coordinate descent method ean eseape some first-order stationary points, because it uses higher-order 
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information. For example, any gradient-based method cannot be initialized at zero (the gradient is 
0), also the ANLS-based algorithm from |24j cannot escape from zero. 


Convergence By construction, the objective function is nonincreasing under the updates of Algo- 
rithm|4]while it is bounded from below. Moreover, since our initial estimate Hq is initially scaled (I16jl . 
we have ||A — HqHq\\f < ||A||i? and therefore any iterate H of Algorithm [J] satisfies 

||i///^||ir-||A||,. < \\A-HH^\\f < ||A-i/o^o^||F < IIAIIi.. 


Since H >0, we have for all k 

r 

k=l 


which implies that ||F/^:fc ||2 < A/2||A||i? for all k hence all iterates of Algorithm 0] belong in a compact 
set. Therefore, Algorithm [4] generates a converging subsequence (Bolzano-Weierstrass theorem). (Note 
that, even if the initial iterate is not scaled, all iterates belong to a compact set, replacing 2||A||j7’ by 
\\A\\f + \\A-HqHI\\f.) 

Unfortunately, in its current form, it is difficult to prove convergence of our algorithm to a station¬ 
ary point. In fact, to guarantee the convergence of an exact cyclic coordinate method to a stationary 
point, three sufficient conditions are (i) the objective function is continuously differentiable over the 
feasible set, (ii) the sets over which the blocks of variables are updated are compact as well as conve43) 
and (hi) the minimum computed at each iteration for a given block of variables is uniquely attained ; 
see Prop. 2.7.1 in Conditions (i-ii) are met for Algorithm [H Unfortunately, it is not necessarily 

the case that the minimizer of a fourth order polynomial is unique. (Note however that for a randomly 
generated polynomial, this happens with probability 0. We have observed numerically that this in 
fact never happens in our numerical experiments, although there are counter examples.) 

A possible way to obtain convergence is to apply the maximum block improvement (MBI) method, 
that is, at each iteration, only update the variable that leads to the largest decrease of the objective 
function [9] . Although this is theoretically appealing, this makes the algorithm computationally much 
more expensive hence much slower in practice. (A possible fix is to use MBI not for every iteration, 
but every Tth iteration for some fixed T.) 

In all our numerical experiments, we have always observed that the sequence of iterates generated 
by Algorithm 0] converged to a unique limit point. In that case, we can prove that this limit point is 
a stationary point. 

Theorem 1. Let (77(o)) • • •) be a sequence of iterates generated by Algorithm\^ If that sequence 

converges to a unique accumulation point, it is a stationary point of symNMF m- 


Proof. This proof follows similar arguments as the proof of convergence of exact cyclic CD for 
NMF [17]. Let H be the accumulation point of the sequence (i7(o)j • • • )> that is. 


lim H 

k^oo 


[k) 


H. 


Note that, by construction, 

'^An alternative assumption to the condition (ii) under which the same result holds is when the function is monoton- 
ically nonincreasing in the interval from one iterate to the next [?]. 
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Note also that we consider that only one variable has been npdated between and 

Assume H is not a stationary point of ([I]): therefore, there exists (i,j) such that 

• Hij = 0 and VF{H)ij < 0, or 

• Hiy > 0 and VF{H)ij / 0. 

In both cases, since F is smooth, there exists p ^ 0 such that 

F{H + pE^^) = F{H) - e < F{H), 

for some e > 0, where is the matrix of all zeros except at the {i,j)th entry where it is equal to 
one and H + pE^^ > 0. 

Let us define ...) a subsequence of {H(^QyH(yy ...) as follows: is the iterate 

for which the (i, j)th entry is updated to obtain H^nk+i)- Since Algorithmic updates the entries of H 
column by column, we have Uk = (j — l)n + i — 1 + nrk for A: = 0,1,... . 

By continuity of E and the convergence of the sequence Hy^^y there exists K sufficiently large so 
that for all /c > AT: 

< E{H)-^-. (17) 

In fact, the continuity of F implies that for all ^ > 0, there exists d > 0 sufficiently small such that 
11^ — < 5 ^ W{H) — A(L7(„j.))| < It suffices to choose Uk sufficiently large so that 5 is 

sufficiently small (since converges to H) for the value ^ = e/2. 

Let us flip the sign of (fT7|) and add F{H(^^^'^) on both sides to obtain 

> F(F(,^))-F(^) + |. 

By construction of the subsequence, the (z,j)th entry of is updated first (the other entries are 

updated afterward) to obtain which implies that 

< r(ff, „.+!,) < +pE'J) 


hence 


F(E,„.,) - > F(E,„.,)-F(E(„.,+pE«) 

e 

~ 2 ’ 

since F{H) < F{Hy^^^^). We therefore have that for all k > K, 


a contradiction since F is bounded below. □ 

Note that Theorem [1] is useful in practice since it can easily be checked whether Algorithm 0] 
converges to a unique accumulation point, plotting for example the norm between the different iterates. 
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5 Numerical results 


This section shows the effectiveness of Algorithm 0] on several data sets compared to the state-of- 
the-art techniques. It is organized as follows. In section 15.11 we describe the real data sets and, in 
section [5^ the tested symNMF algorithms. In section [531 we describe the settings we use to compare 
the symNMF algorithms. In section [57il we provide and discuss the experimental results. 

5.1 Data sets 

We will use exactly the same data sets as in M- Because of space limitation, we only give the results 
for one value of the factorization rank r, more numerical experiments are available on the arXiv version 
of this paper m- In [13] , authors use four dense data sets and six sparse data sets to compare several 
NMF algorithms. In this section, we use these data sets to generate similarity matrices A on which we 
compare the different symNMF algorithms. Given a nonnegative data set X G we construct 

the symmetric similarity matrix A = X'^X G so that the entries of A are equal to the inner 

products between data points. Table 0] summarizes the dense data sets, corresponding to widely used 
facial images in the data mining community. Table [2] summarizes the characteristics of the different 
sparse data sets, corresponding to document datasets and described in details in [37] . 


Table 1: Image datasets. 


Data 

# pixels 

m 

n 

ORL^ 

112 X 92 

10304 

400 

Umist^ 

112 X 92 

10304 

575 

CBCL^ 

19 X 19 

361 

2429 

Frey^ 

28 X 20 

560 

1965 


^ http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html 
^ http://www.cs.toronto.edu/~roweis/data.html 
http: //cbcl .mit. edu/cbcl/sof tware-datasets/FaceData2 .html 


Table 2: Text mining data sets (sparsity is given as the percentage of zeros). 


Data 

m 

n 

T^tnonzero 

sparsity X 

sparsity X^ X 

classic 

7094 

41681 

223839 

99.92 

99.50 

sports 

8580 

14870 

1091723 

99.14 

84.51 

reviews 

4069 

18483 

758635 

98.99 

84.24 

hitech 

2301 

10080 

331373 

98.57 

80.32 

ohscal 

11162 

11465 

674365 

99.47 

91.58 

lal 

3204 

31472 

484024 

99.52 

95.72 


5.2 Tested symNMF algorithms 

We compare the following algorithms 
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1. (Newton) This is the Newton-like method from |24| . 

2. (ANLS) This is the method based on the ANLS method for NMF adding the penalty | |kT — Fr| 
in the objective function (see Introduction) from [23]. Note that ANLS has the drawback to 
depend on a parameter that is nontrivial to tune, namely, the penalty parameter for the term 
IIVL — in the objective function (we used the default tuning strategy recommended by the 
authors). 

3. (tSVD) This method, recently introduced in |T8|, first computes the rank-r truncated SVD of 
A ~ Aj. = Ur^rUj where Ur contains the first r singular vectors of A and is the r-by-r 
diagonal matrix containing the first r singular values of A on its diagonal. Then, instead of 
solving m, the authors solve a ‘closeby’ optimization problem replacing A with A^ 

min \\Ar — HH'^\\p. 

H>0 

Once the truncated SVD is computed, each iteration of this method is extremely cheap as the 
main computational cost is in a matrix-matrix product BrQ, where Br = Ur'^r and Q is an 
r-by-r rotation matrix, which can be computed in 0{nr‘^) operations. Note also that they use 
the initialization Hq = max(0, i?^) “We flipped the signs of the columns of Ur to maximize the 
£2 norm of the nonnegative part |6]. 

4. (BetaSNMF) This algorithm is presented in [151 Algorithm 4], and is based on multiplicative 
updates (similarly as for the original NMF algorithm proposed by Lee and Seung [26]). Note 
that we have also implemented the multiplicative update rules from [35] (and already derived 
in [28]). However, we do not report the numerical results here because it was outperformed by 
BetaSNMF in all our numerical experiments, an observation already made in m- 

5. (CD-X-Y) This is Algorithm^ X is either ‘Cyclic’ or ‘Shuffle’ and indicates whether the columns 
of H are optimized in a cyclic way or if they are shuffled randomly before each iteration. Y is 
for the initialization: Y is ‘rand’ for random initialization and is ‘0’ for zero initialization; see 
section 0] for more details. Hence, we will compare four variants of Algorithm |4l CD-Cyclic-0, 
CD-Shuffle-0, CD-Cyclic-Rand and CD-Shuffle-Rand. 

Because Algorithm 0] requires to perform many loops (nr at each step), Matlab is not a well- 
suited language. Therefore, we have developed a C implementation, that can be called from 
Matlab (using Mex files). Note that the algorithms above are better suited for Matlab since 
the main computational cost resides in matrix-matrix products, and in solving linear systems of 
equations (for ANLS and Newton). 

Newton and ANLS are both available from http://math.ucla.edu/~dakuang/, while we have 
implemented tSVD and BetaSNMF ourselves. 

For all algorithms using random initializations for the matrix H, we used the same initial matrices. 
Note however that, in all the figures presented in this section, we will display the error after the first 
iteration, which is the reason why the curves do not start at the same value. 
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5.3 Experimental setup 


In order to compare for the average performance of the different algorithms, we denote Cmm the 
smallest error obtained by all algorithms over all initializations, and define 


m 



^min 


^min 


(18) 


where e{t) is the error \ \A — HH'^\ |ir achieved by an algorithm for a given initialization within t seconds 
(and hence e(0) = ||A — HqHqWf where Hq is the initialization). The quantity E{t) is therefore a 
normalized measure of the evolution of the objective function of a given algorithm on a given data set. 

The advantage of this measure is that it separates better the different algorithms, when using 
a log scale, since it goes to zero for the best algorithm (except for algorithms that are initialized 
randomly as we will report the average value of E{t) over several random initializations; see below). 
We would like to stress out that the measure E{t) from (IlSp has to be interpreted with care. In 
fact, an algorithm for which E(t) converges to zero simply means that it is the algorithm able to 
find the best solution among all algorithms (in other words, to identify a region with a better local 
minima). In fact, the different algorithms are initialized with different initial points: in particular, 
tSVD uses an SVD-based initialization. It does not necessarily mean that it converges the fastest: to 
compare (initial) convergence, one has to look at the values E(t) for t small. However, the measure 
E{t) allows to better visualize the different algorithms. For example, displaying the relative error 
||H — allows to compare the initial convergence, but then the errors for all algorithms 

tend to converge at similar values and it is difficult to identify visually which one converges to the 
best solution. 

For the algorithms using random initialization (namely, Newton, ANLS, CD-Cyclic-Rand and CD- 
Shuffle-Rand), we will run the algorithms 10 times and report the average value of E{t). For all data 
sets, we will run each algorithm for 60 seconds. 

All tests are performed using Matlab on a PC Intel CORE 15-4570 CPU @3.2GHz x 4, with 7.7G 
RAM. The codes are available online from https://sites.google.com/site/nicolasgillis/. 

Remark 2 (Computation of the error). Note that to compute \ \A — , one should not compute 

HH^ explicitly (especially if A is sparse) and use instead 

\\A-HH^\\l = \\A\\l-2{A,HH^) + \\HH^\\l 
= \\A\\l-2{AH,H) + \\H^H\\l. 


5.4 Comparison 

We now compare the different symNMF algorithms listed in section [52] according to the measure given 
in (fTSj) on the data sets described in section (521 and on synthetic data sets. 

5.4.1 Real data sets 

We start with the real data sets. 

Dense image data sets Figure [2| displays the results for the dense real data sets. Table [3| gives 
the number of iterations performed by each algorithm within the 60 seconds, and Table 01 the final 
relative error ||A — Hi/^||/||A||i? in percent. 

We observe the following: 
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Table 3; Average number of iterations performed by each algorithm within 60 seconds for the dense 
real data sets. 


r = 60 

ANLS 

Newton 

tSVD 

BetaSNMF 

CD-Cyc.-O 

CD-Shuf.-O 

CD-Cyc.-Rand 

CD-Shuf.-Rand 

ORL 

6599 

2861 

32685 

33561 

2514 

2501 

2261 

2306 

Umist 

5305 

1431 

25982 

18218 

1500 

1496 

1406 

1357 

CBCL 

473 

4 

12109 

1281 

116 

113 

95 

95 

Frey 

705 

5 

15373 

1876 

166 

157 

138 

139 


Table 4: Average relative error in percent (100 * ||A — li?/!|A|li?) of the final solution obtained 

by each algorithm within 60 seconds for the dense real data sets. For algorithms based on random 
initializations, the standard deviation is given. 


r = 60 

ANLS 

Newton 

tSVD 

BetaSNMF 

CD-Cyc.-O 

CD-Shuf.-O 

CD-Cyc.-Rand 

CD-Shuf.-Rand 

ORL 

0.288 ± 4e-3 

0.341 

0.141 

0.143 ± 3e-4 

0.142 

0.143 

0.165 ± 6e-4 

0.141 ± 8e-5 

Umist 

0.718 ± 0.023 

0.365 

0.041 

0.073 ± 7e-4 

0.043 

0.044 

0.108 ± 2e-3 

0.042 ± 2e-4 

CBCL 

0.254 ± 2e-3 

4.52 

0.046 

0.679 ± 3e-3 

0.169 

0.176 

0.751 ± 7e-3 

0.157 ± le-3 

Frey 

0.083 ± 6e-4 

4.88 

0.057 

0.510 ± 2e-3 

0.105 

0.107 

0.765 ± 4e-3 

0.124 ± 2e-3 


• In all cases, tSVD performs best and is able to generate the solution with the smallest objective 
function value among all algorithms. This might be a bit surprising since it works only with 
an approximation of the original data: it appears that for these real dense data sets, this 
approximation can be computed efficiently and allows tSVD to converge extremely fast to a 
very good solution. 

One of the reasons tSVD is so effective is because each iteration is n times cheaper (once the 
truncated SVD is computed) hence it can perform many more iterations; see Tabled Another 
crucial reason is that image data sets can be very well approximated by low-rank matrices (see 
section [5. 4.2 1 for a confirmation of this behavior). Therefore, for images, tSVD is the best method 
to use as it provides a very good solution extremely fast. 

• When it comes to initial convergence, CD-Cyclic-0 and CD-Shuffle-0 perform best: they are able 
to generate very fast a good solution. In all cases, they are the fastest to generate a solution 
at a relative error of 1% of the final solution of tSVD. Moreover, the fact that tSVD does not 
generate any solution as long as the truncated SVD is not computed could be critical for larger 
data sets. For example, for CBCL with n = 2429 and r = 60, the truncated SVD takes about 6 
seconds to compute while, in the mean time, CD-Cyclic-0 and CD-Shuffle-0 generate a solution 
with relative error of 0.3% from the final solution obtained by tSVD after 60 seconds. 

• For these data sets, CD-Cyclic-0 and CD-Shuffle-0 perform exactly the same: for the zero ini¬ 
tialization, it seems that shuffling the columns of H does not play a crucial role. 

• When initialized randomly, we observe that the CD method performs significantly better with 
random shuffling. Moreover, CD-Shuffle-Rand converges initially slower than CD-Shuffle-0 but 
is often able to converge to a better solution; in particular for the ORL and Umistim data sets. 
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CBCL - r=60 Frey - r=60 






Figure 2: Evolution of the measure (I18jl of the different symNMF algorithms on the dense real data 
sets for r = 60. 

• Newton converges slowly, the main reason being that each iteration is very costly, namely 0(n^r) 
operations. 

• ANLS performs relatively well: it never converges initially faster than CD-based approaches but 
is able to generate a better final solution for the Frey data set. 

• BetaSNMF does not perform well on these data sets compared to tSVD and CD methods, 
although performing better than ANLS and 2 out of 4 times better than ANLS. 

• For algorithms based on random initializations, the standard deviation between several runs is 
rather small, illustrating the fact that these algorithms converge to solutions with similar final 
errors. 

Conclusion: for image data sets, tSVD performs the best. However, CD-Cyclic-0 allows a very 
fast initial convergence and can be used to obtain very quickly a good solution. 


17 


















Sparse document data sets Figure [3] displays the results for the real sparse data sets. Table [S] 
gives the number of iterations performed by each algorithm within the 60 seconds, and Table [6] the 
final relative error ||^ — FfFf^||/||j4||i? in percent. 

For some data sets (namely, lal and reviews), computing the truncated SVD of A was not possible 
with Matlab within 60 seconds hence tSVD was not able to return any solution; see Remark [3] for a 
discussion. Moreover, Newton is not displayed because it is not designed for sparse matrices and runs 
out of memory [24j . 


Table 5: Average number of iterations performed by each algorithm within 60 seconds for the sparse 
real data sets. 


r = 30 

ANLS 

tSVD 

BetaSNMF 

CD-Cyc.-O 

CD-Shuf.-O 

CD-Cyc.-Rand 

CD-Shuf.-Rand 

classic 

41.3 

1212 

237 

44 

44 

44 

44 

sports 

34 

4330 

66 

23 

23 

23 

23 

reviews 

16.7 

0 

41 

13 

13 

13 

13 

hitech 

61 

8334 

115 

37 

37 

37 

37 

ohscal 

91.7 

7855 

199 

61 

61 

61 

61 

lal 

16 

0 

43 

15 

15 

15 

15 


Table 6: Average relative error in percent (100 * ||A — |i7’/||A||i7’) of the final solution obtained 

by each algorithm within 60 seconds for the sparse real data sets. For algorithms based on random 
initializations, the standard deviation is given. 


r = 30 

ANLS 

tSVD 

BetaSNMF 

CD-Cyc.-O 

CD-Shuf.-O 

CD-Cyc.-Rand 

CD-Shuf.-Rand 

classic 

99.99 ± le-4 

39.8 

38.1 ± 0.14 

37.6 

37.8 

37.6 ± 0.09 

37.7 ± 0.09 

sports 

99.9 ± le-3 

19.2 

20.1 ± 0.28 

17.5 

17.7 

17.5 ± 0.11 

17.7 ± 0.10 

reviews 

99.9 ± 7e-4 

/ 

20.0 ± 0.56 

16.3 

16.4 

16.3 ± 0.10 

16.3 ± 0.08 

hitech 

99.5 ± 4e-3 

33.3 

31.3 ± 0.22 

30.5 

30.5 

30.4 ± 0.09 

30.4 ± 0.08 

ohscal 

99.95 ± le-3 

22.2 

21.6 ± 0.11 

20.9 

21.0 

20.9 ± 0.04 

20.9 ± 0.04 

lal 

99.9 ± 8e-4 

/ 

34.9 ± 0.32 

31.9 

32.0 

32.1 ± 0.10 

32.0 ± 0.10 


We observe the following: 

• tSVD performs very poorly. The reason is twofold: (1) the truncated SVD is very expensive to 
compute and (2) sparse matrices are usually not close to being low-rank hence tSVD converges 
to a very poor solution (see section fS. 4. 21 for a confirmation of this behavior). 

• ANLS performs very poorly and is not able to generate a good solution. In fact, it has difficulties 
to decrease the objective function (on the figures, it seems it does not decrease, but it actually 
decreases very slowly). 

• BetaSNMF performs better than ANLS but does not compete with CD methods. (Note that, 
for the classic data sets, BetaSNMF was stopped prematurely because there was a division by 
zero which could have been avoided but we have strictly used the description of Algorithm 4 
in [T5]i. 
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Figure 3: Evolution of the measure (jl8h of the different symNMF algorithms on real sparse data sets 
for r = 30. 


19 




























• All CD-based approaches are very effective and perform similarly. It seems that, in these cases, 
nor the initialization nor the order in which the columns of H are updated plays a significant 
role. 

However, we observe that in all cases, E{t) converges to a value between 10“^ and 10“^, never to 
a smaller value. This means that the best solution is always obtained with random initialization 
(in fact, for algorithms initialized randomly. Figure [3] reports the average over 10 runs) but, on 
average, random initialization performs similarly as the initialization with zero. 

Conclusion: for sparse document data sets, CD-based approaches outperform significantly the 
other tested methods. 

Remark 3 (SVD computation in tSVD). It has to he noted that, in our numerical experiments, the 
matrix A is constructed using the formula A = X'^X, where the columns of the matrix X are the 
data points. In other words, we use the simple similarity measure y'^z between two data points y and 
z. In that case, the SVD of A can be obtained from the SVD of X, hence can be made (i) more 
efficient (when X has more columns than rows, that is, m <^n), and (ii) numerically more accurate 
(because the condition number of X^X is equal to the square of that of X); see, e.g., Lecture 31]. 
Moreover, in case of sparse data, this avoids the fill-in, as observed in Table [H where X'^X is denser 
than X. Therefore, in this particular situation when A = X^X and X is sparse and/or m <^n, it is 
much better to compute the SVD of A based on the SVD of X. Table^l^gives the computational time in 
both cases. In this particular scenario, it would make sense to use tSVD as an initialization procedure 


Table 7: Computational time required to compute the rank-30 truncated SVD of X and X'^X using 
Matlab. _ 


svds(.,30) 

classic 

hitech 

lal 

ohscal 

reviews 

sports 

X'*X 

17.14 

18.54 

63.33 

15 

67.32 

31.77 

X 

5.55 

0.82 

3.08 

2.87 

1.39 

2.98 


for CD methods to obtain rapidly a good initial iterate. However, looking at Figure 0 and Table 0 
indicates that this would not necessarily be advantageous for the CD-based methods in all cases. For 
example, for the classic data set, tSVD would achieve a relative error of 39.8% within about 6 seconds 
while CD methods obtain a similar relative error within that computing time. For the hitech data set 
however, this would be rather helpful since tSVD would only take about 1 second to obtain a relative 
error of 33.3% while CD methods require about 9 seconds to do so. 

However, the goal of this paper is to provide an efficient algorithm for the general symNMF problem, 
without assuming any particular structure on the matrix A (in practice the similarity measure between 
data points is usually not simply their inner product). . Therefore, we have not assumed that the 
matrix A had this particular structure and only provide numerical comparison in that case. 

Remark 4 (Low-rank models for full-rank matrices). Although sparse data sets are usually not low 
rank, it still makes sense to try to find a low-rank structure that is close to a given data set, as this often 
allows to extract some pertinent information. In particular, in document classification and clustering, 
low-rank models have proven to be extremely useful; see the discussion in the Introduction and the 
references therein. Another important application where low-rank models have proven extremely useful 
although the data sets are usually not low-rank is recommender systems \2^ . We also refer the reader 
to the recent survey W. 
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Figure 4: Evolution of the measure (jlSp of the different symNMF algorithms on dense and low-rank 
synthetic data sets for r = 30 (left) and r = 60 (right). 

5.4.2 Synthetic data sets: low-rank vs. full rank matrices 

In this section, we perform some numerical experiments on synthetic data sets. Our main motivation is 
to confirm the (expected) behavior observed on real data: tSVD performs extremely well for low-rank 
matrices and poorly on full-rank matrices. 

Low-rank input matrices The most natural way to generate nonnegative symmetric matrices of 
given cp-rank is to generate randomly and then compute A = In this section, we use the 

Matlab function = rand(n, r) with n = 500 and r = 30, 60, that is, each entry of is generated 
uniformly at random in the interval [0,1]. We have generated 10 such matrices for each rank, and 
Figure m displays the average value for the measure (fTSl) but we use here Cmin = 0 since it is the known 
optimal value. 

We observe that, in all cases, tSVD outperforms all methods. Moreover, it seems that the SVD- 
based initialization is very effective. The reason is that A has exactly rank r and hence its best rank-r 
approximation is exact. Moreover, tSVD only works in the correct subspace in which belongs 
hence converges much faster than the other methods. 

Except for Newton, the other algorithms perform similarly. It is worth noting that the same 
behavior we observed for real dense data sets is present here: CD-Shuffle-Rand performs better than 
CD-Cyclic-Rand, while shuffling the columns of H before each iteration does not play a crucial role 
with the zero initialization. 

Pull-rank input matrices A simple way to generate nonnegative symmetric matrices of full rank 
is to generate a matrix B randomly and then compute A = B + B^. In this section, we use the Matlab 
function B = rand(n) with n = 500. We have generated 10 such matrices for each rank, and Figure [5] 
displays the average value for the measure E{t) from (fT8]l . Figure [5] displays the results. 

We observe that, in all cases, tSVD performs extremely poorly while all other methods (except for 
Newton and BetaSNMF) perform similarly. The reason is that tSVD works only with the best rank-r 
approximation of A, which is poor when A has full rank. 


21 















dense and full-rank synthetic - r=30 
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Figure 5: Evolution of the measure m of the different symNMF algorithms on dense full-rank 
synthetic data sets for r = 30 (left) and r = 60 (right). 

5.4.3 Summary of results 

Clearly, tSVD and CD-based approaches are the most effective, although ANLS sometimes performs 
competitively for the dense data sets. However, tSVD performs extremely well only when the input 
matrix is low rank (cf. low-rank synthetic data sets) or close to being low rank (cf. image data sets). 
There are three cases when it performs very poorly: 

• It cannot perform a symNMF when the factorization rank r is larger than the rank of A, that is, 
when r > rank(A), which may be necessary for matrices with high cp-rank (in fact, the cp-rank 
can be much higher than the rank [3]). 

• If the truncated SVD is a poor approximation of A, the algorithm will perform poorly since it 
does not use any other information; see the results for the full rank synthetic data sets and the 
sparse real data sets. 

• The algorithm returns no solution as long as the SVD is not computed. In some cases, the cost 
of computing the truncated SVD is high and tSVD terminates before any solution to symNMF 
is produced; see the sparse real data sets. 

To conclude, CD-based approaches are overall the most reliable and most effective methods to 
solve symNMF (HD- For dense data sets, initialization at zero allows a faster initial convergence, while 
CD-Shuffle-Rand generates in average the best solution and CD-Cyclic-Rand does not perform well 
and is not recommended. For sparse data sets, all CD variants perform similarly and outperform the 
other tested algorithms. 

6 Conclusion and further research 

In this paper, we have proposed very efficient exact coordinate descent methods for symNMF ([T]) that 
performs competitively with state-of-the-art methods. 

Some interesting directions for further research are the following: 
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• The study of sparse symNMF, where one is looking for a sparser matrix H. A natural model 
would for example use the sparsity-inducing ii norm and try to solve 

min ^\\A - HH^Wl + ^Aj\\H.J\i , (19) 

i=i 

for some penalty parameter A G Algorithm [3] can be easily adapted to handle (1191) . by 
replacing the bij's with bij + Aj. In fact, the derivative of the penalty term only influences the 
constant part in the gradient; see (I12p . However, it seems the solutions of (I19p are very sensitive 
to the parameter A and hence are difficult to tune. Note that another way to identify sparser 
factors is simply to increase the factorization rank r, or to sparsify the input matrix A (only 
keeping the important edges in the graph induced by A; see [T] and the references therein) -in 
fact, a sparser matrix A induces sparser factors since 

Aij = 0 ^ Hi -HJ. ps 0 ^ ~ 0 or Hji^ ~ OVA:. 

This is an interesting observation: Aij = 0 implies a (soft) orthogonality constraints on the rows 
of H. This is rather natural: if item i does not share any similarity with item j {Aij = 0), then 
they should be assigned to different clusters {Hik ~ 0 or Hjk ~ 0 for all k). 

• The design of more efficient algorithms for symNMF. For example, a promising direction would 
be to combine the idea from |18j that use a compressed version of A with very cheap per-iteration 
cost with our more reliable CD method, to combine the best of both worlds. 
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